To Discount or Not to Discount in Reinforcement Learning: A Case Study Comparing R Learning and Q Learning

نویسنده

Sridhar Mahadevan

چکیده

Most work in reinforcement learning (RL) is based on discounted techniques, such as Q learning, where long-term rewards are geometrically attenuated based on the delay in their occurence. Schwartz recently proposed an undiscounted RL technique called R learning that optimizes average reward, and argued that it was a better metric than the discounted one optimized by Q learning. In this paper we compare R learning with Q learning on a simulated robot box-pushing task. We compare these two techniques across three diierent exploration strategies: two of them undirected, Boltz-mann and semi-uniform, and one recency-based directed strategy. Our results show that Q learning performs better than R learning , even when both are evaluated using the same undiscounted performance measure. Furthermore, R learning appears to be very sensitive to choice of exploration strategy. In particular, a surprising result is that R learn-ing's performance noticeably deteriorates under Boltzmann exploration. We identify precisely a limit cycle situation that causes R learning's performance to deteriorate when combined with Boltzmann exploration, and show where such limit cycles arise in our robot task. However, R learning performs much better (although not as well as Q learning) when combined with semi-uniform and recency-based exploration. In this paper, we also argue for using medians over means as a better distribution-free estimator of average performance, and describe a simple non-parametric signiicance test for comparing learning data from two RL techniques.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

On-Line Learning of a Persian Spoken Dialogue System Using Real Training Data

The first spoken dialogue system developed for the Persian language is introduced. This is a ticket reservation system with Persian ASR and NLU modules. The focus of the paper is on learning the dialogue management module. In this work, real on-line training data are used during the learning process. For on-line learning, the effect of the variations of discount factor (g) on the learning speed...

متن کامل

On-Line Learning of a Persian Spoken Dialogue System Using Real Training Data

متن کامل

Manufactured in The Netherlands . Average Reward Reinforcement Learning : Foundations , Algorithms , and Empirical

This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dynamic programming methods to several (provably convergent) asyn-chronous algorithms from optimal co...

متن کامل

Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results Editor: Leslie Kaelbling

متن کامل

Sensitive Discount Optimality: Unifying Discounted and Average Reward Reinforcement Learning

Research in reinforcement learning (RL) has thus far concentrated on two optimality criteria: the discounted framework, which has been very well-studied, and the average-reward framework, in which interest is rapidly increasing. In this paper, we present a framework called sensitive discount optimality which ooers an elegant way of linking these two paradigms. Although sensitive discount optima...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 1994

To Discount or Not to Discount in Reinforcement Learning: A Case Study Comparing R Learning and Q Learning

نویسنده

چکیده

منابع مشابه

On-Line Learning of a Persian Spoken Dialogue System Using Real Training Data

On-Line Learning of a Persian Spoken Dialogue System Using Real Training Data

Manufactured in The Netherlands . Average Reward Reinforcement Learning : Foundations , Algorithms , and Empirical

Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results Editor: Leslie Kaelbling

Sensitive Discount Optimality: Unifying Discounted and Average Reward Reinforcement Learning

عنوان ژورنال:

اشتراک گذاری